The attached logins.json file contains (simulated) timestamps of user logins in a particular geographic location. Aggregate these login counts based on 15 minute time intervals, and visualize and describe the resulting time series of login counts in ways that best characterize the underlying patterns of the demand. Please report/illustrate important features of the demand, such as daily cycles. If there are data quality issues, please report them.
The data in 15-minutes format is to dense, so it needs to be resampled to properly visually identify any parterns. It is to resample the data daily and plotting the total logins per day as the figure below shows.

The daily plot shows that activity tends to increase at the end of each week without exception and that it drastically drops the following week to increase again. The exception to this rule is the third week of March 1970 where most of the daily counts are relatively high.
The figure below represents the second week of January 1970 with an hourly resampling of the data.

Daily aggregation shows that daily there are picks of demand at noon and midnight and that most of the demand is during the weekends.
The neighboring cities of Gotham and Metropolis have complementary circadian rhythms: on weekdays, Ultimate Gotham is most active at night, and Ultimate Metropolis is most active during the day. On weekends, there is reasonable activity in both cities.
However, a toll bridge, with a two way toll, between the two cities causes driver partners to tend to be exclusive to each city. The Ultimate managers of city operations for the two cities have proposed an experiment to encourage driver partners to be available in both cities, by reimbursing all toll costs.
1. What would you choose as the key measure of success of this experiment in encouraging driver partners to serve both cities, and why would you choose this metric? For this experiment I would choose as a key measure of success the average number of times the toll bridge is being used by driver. An increase in that number will mean that reimbursing the toll cost is good initiative to encourage drivers to work between cities. However, we need to take into account the complementary circadian rythms of both cities. So, when calculating the averages, I will calculate an average for weekdays and another one for weekends.
2. Describe a practical experiment you would design to compare the effectiveness of the proposed change in relation to the key measure of success. Please provide details on:
a. how you will implement the experiment
I will divide randomly the drivers population in two, A and B. For froup A the toll costs will be reimbursed (this the change the company wants to introduce), and for group B nothing will be reinbursed. Then for each user of each group I will record the number of times the toll bridge is used. As mentioned before, weekdays will be recorded separately from weekends.
b. what statistical test(s) you will conduct to verify the significance of the observation The experiment corresponds to an AB test. For each population we are interested in the average number of times the toll bridge was used per day. So if there is a difference between the average of both populations, we want to determine if that difference is statistically significant. For this purpose I will implement a z-test where I will evaluate the statistical significance of the difference in average between both populations.
Given the difference in behavior between weekdays and weekends, I will perform the test for weekdays and weekends separately.
c. how you would interpret the results and provide recommendations to the city operations team along with any caveats. If a statisticall significant increase in the toll bridge usage is observed, I will recommend the city operations team to go with the changes. However, as mentioned before I will determine first if there is a difference between weekdays and weekends. If there is, that information will be provided too so the toll costs are reimboursed only when the bridge is used the appropriate days. For example it may make sense to reimburse the costs during the weekends but not during weekdays. Having saying this, it may also be useful when analyzing the data to conduct the statistical test by the hour to determine if the toll bridge usage is also time dependent particularly in weekdays.
Ultimate is interested in predicting rider retention. To help explore this question, we have provided a sample dataset of a cohort of users who signed up for an Ultimate account in January 2014. The data was pulled several months later; we consider a user retained if they were “active” (i.e. took a trip) in the preceding 30 days.
We would like you to use this data set to help understand what factors are the best predictors for retention, and offer suggestions to operationalize those insights to help Ultimate.
The data is in the attached file ultimate_data_challenge.json. See below for a detailed description of the dataset. Please include any code you wrote for the analysis and delete the dataset when you have finished with the challenge.
● city: city this user signed up in
● phone: primary device for this user
● signup_date: date of account registration; in the form ‘YYYY MM DD’
● last_trip_date: the last time this user completed a trip; in the form ‘YYYY MM DD’
● avg_dist: the average distance in miles per trip taken in the first 30 days after signup
● avg_rating_by_driver: the rider’s average rating over all of their trips
● avg_rating_of_driver: the rider’s average rating of their drivers over all of their trips
● surge_pct: the percent of trips taken with surge multiplier > 1
● avg_surge: The average surge multiplier over all of this user’s trips
● trips_in_first_30_days: the number of trips this user took in the first 30 days after signing up
● ultimate_black_user: TRUE if the user took an Ultimate Black in their first 30 days; FALSE otherwise
● weekday_pct: the percent of the user’s trips occurring during a weekday
1. Perform any cleaning, exploratory analysis, and/or visualizations to use the provided data for this analysis (a few sentences/plots describing your approach will suffice). What fraction of the observed users were retained?
avg_rating_by_driver', 'avg_rating_of_driver' and 'phone' have null values. However the rows containing those null fields may still contain usefull information. The average value for the ratings tends to be very good, so for 'avg_rating_by_driver' and 'avg_rating_of_driver' the null values are changed their mean values which are 4.78 and 4.608 respectively. For the 'phone' collumns, the null values will be replaced by the string 'other'. The column 'ultimate_black_user' is converted from boolean to int. 'last_trip_date' and 'signup_date' are turned to datetime type.
An inspection of the pairwise correlations bewteen numerical collumns doesn't show a strong correlation between variables. The strongest correlation is bewteen 'avg_surge' and 'surge_pct' with a value of 0.79. This indicates that practically all the columns can be considered as independent variables for future analysis. This confirmed by the matrix correlation plots shown below.

The diagonal is the distribution of values for each variable. It shows that most of the variables values are squeezed in one direction. When we train the models, this may make the models to underperform, so it will may be better to rescale the variables which are squeezed and cover a long range of values. But this decision will be done after the model to train is chosen.
To calculate the fraction of retained users, given that the exact date where the data was retrieved is not provided, the most recent date in the column 'last_trip_date' is considered as the reference to estimate user retention. Retained users are those who have completed a trip in the preceding 30 days with respect to the reference date. Then, up to the most recent trip date in the data set, the fraction of retained users is 37.61%.
2. Build a predictive model to help Ultimate determine whether or not a user will be active in their 6th month on the system. Discuss why you chose your approach, what alternatives you considered, and any concerns you have. How valid is your model? Include any key indicators of model performance.
The problem at hand is a binary classification. For this kind of situations SVM, logistic regressor classifier and decision trees are good options. However we want to interpret the results at in addition to having a reliable prediction. This eliminates SVM, given that despite its high accuracy, its results are harder to interpret. Between the logistic regressor and the decision tree, I decided to use a decision tree given that the interpretation of its results is easier to analyze. But with decision trees it's important to be careful because it is easier to overfit the model. So, among the decision tree models I chose a random forest classifier given that it prevents overfitting and reduces bias more easily. Given that we are using a decision tree model, then it is not necessary to rescale the features. Duting training, the model will be able to make the decisions properly with the data as it is.
In this problem we care about identifying properly the users that are active in the 6th month after subscription. A user active after 6 months is a positive class or 1, and an inactive user is a negative class or 0. Given that the data set is unbalanced with the positive classes representing only about 37% of the data, then the most appropriate metric to evaluate the model is the precision. Precision is the fraction of correctly detected positive targets among all the targets detected as positive.
The best hyper-paramters for the random forest model were carefully determined performing several grid-searches with cross-validation in order to avoid overfitting of the training data and obtain a good performance in the test data. The following metrics were obtained in the test data:
Accuracy: 78.44%
Precision for positive classes: 74.81%
3. Briefly discuss how Ultimate might leverage the insights gained from the model to improve its long-term rider retention (again, a few sentences will suffice).
The most import aspect ensuring driver retention is 'avg_rating_by_driver'. So Ultimate.Inc needs to make sure that user experience is up to expectations, maybe even by provding some training to the drivers in customer interaction.
People from King's Landing are more likely to be retained than people from the other cities. Similarly, iPhone owners are more likely to be retained than Android users. iPhone users are associated with higher income, so I believe people from King's Landing are also more wealthy than the ones from other cities. So, to increase user retention from people with less income, special discounts may be proposed to Android users and unhabitants from Winterfell and Astapor.
Given that the number of trips in the first 30 days plays an important role in user retention, during the first month after subscription special offers can be created to encourage service use during that period.
Finally, given that 'weekday_pct' is also an important factor in user retention, Ultimate may introduce incentives encouraging the number of trips per week or even a system of points per miles spent with the service.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
The attached l ogins.json file contains (simulated) timestamps of user logins in a particular geographic location. Aggregate these login counts based on 15 minute time intervals, and visualize and describe the resulting time series of login counts in ways that best characterize the underlying patterns of the demand. Please report/illustrate important features of the demand, such as daily cycles. If there are data quality issues, please report them.
# load json as string
# json.load((open('./logins.json')))
logins_df = pd.read_json('./logins.json')
logins_df.info()
logins_df.head()
logins_df.describe()
# Set time stamps as index
logins_df.index = logins_df.login_time
# Aggregate time every 15 minutes
login_agg = logins_df.resample('15min').count()
login_agg.columns = ['15_min_count']
print(login_agg.head())
# Aggregation sum must be equal to len of original data frame
login_agg['15_min_count'].sum() == len(logins_df)
fig = login_agg.plot(figsize=(20, 10), fontsize=20)
fig.legend(["15 minutes count"])
_ = plt.xlabel("Login Time", size=20)
_ = plt.ylabel("15 Minutes Count", size=20)
_ = plt.show()
The data is to dense, it needs to be resampled to properly visually identify any parterns. Maybe it would be helpful to resample the data daily and plotting the total logins per day.
hourly = login_agg.resample("D").sum()
_ = hourly.plot(figsize=(20, 10), fontsize=20)
_ = plt.xlabel("Login Time", size=20)
_ = plt.ylabel("Daily Count", size=20)
_ = plt.show()
The daily plot shows that activity tends to increase at the end of each week without exception and that it drastically drops the following week to increase again. The exception to this rule is the third week of March 1970 where most of the daily counts are relatively high.
Let's now see what happens daily by resampling the data hourly and choosing a random week.
hourly = login_agg.resample("H").sum()
_ = hourly[74:288].plot(figsize=(20, 10), fontsize=20)
_ = plt.xlabel("Login Time", size=20)
_ = plt.ylabel("Hourly Count", size=20)
_ = plt.show()
Daily aggregation shows that daily there are picks of demand at noon and midnight and that most of the demand is during the weekends.
1. Perform any cleaning, exploratory analysis, and/or visualizations to use the provided data for this analysis (a few sentences/plots describing your approach will suffice). What fraction of the observed users were retained?
# ultimate_df = pd.read_json('./ultimate_data_challenge.json')%%!
df = pd.DataFrame(json.load((open('./ultimate_data_challenge.json'))))
df.head()
df.describe()
df.info()
'avg_rating_by_driver', 'avg_rating_of_driver' and 'phone' have null values. However the rows containing those null fields may still contain usefull information. The average value for the ratings tends to be very good, so for 'avg_rating_by_driver' and 'avg_rating_of_driver' the null values are changed their mean values which are 4.78 and 4.608 respectively. For the 'phone' columns, the null values will be replaced by the string 'other'. The column 'ultimate_black_user' is converted from boolean to int. 'last_trip_date' and 'signup_date' are turned to datetime type.
# Copy df to start modifying data set
data = df.copy(deep=True)
# Fill null values
data.avg_rating_by_driver.fillna(data.avg_rating_by_driver.mean(), inplace=True)
data.avg_rating_of_driver.fillna(data.avg_rating_of_driver.mean(), inplace=True)
data.phone.fillna('other', inplace=True)
# Change boolean to int
data.ultimate_black_user = data.ultimate_black_user.astype(int)
# Turn time columns to datetime format
data.last_trip_date = pd.to_datetime(data.last_trip_date, infer_datetime_format=True)
data.signup_date = pd.to_datetime(data.signup_date, infer_datetime_format=True)
data.info()
data.corr()
An inspection of the pairwise correlations bewteen numerical collumns doesn't show a strong correlation between variables. The strongest correlation is bewteen 'avg_surge' and 'surge_pct' with a value of 0.79. This indicates that practically all the columns can be considered as independent variables for future analysis.
Pairwise correlation plot.
sns.pairplot(data, diag_kind="kde")
plt.show()
The diagonal is the distribution of values for each variable. It shows that most of the variables values are squeezed in one direction. When we train the models, this may make the models to underperform, so it will may be better to rescale the variables which are squeezed and cover a long range of values. But this decision will be done after the model to train is chosen.
To calculate the fraction of retained users, given that the exact date where the data was retrieved is not provided, the most recent date in the column 'last_trip_date' is considered as the reference to estimate user retention. Retained users are those who have completed a trip in the preceding 30 days with respect to the reference date.
ref_date = data.last_trip_date.max()
# Fraction of retained users
fraction = np.sum((ref_date - data.last_trip_date) <= '30 days') / len(data) * 100
print("Fraction of retained users: %0.2f%%"%(fraction))
2. Build a predictive model to help Ultimate determine whether or not a user will be active in their 6th month on the system. Discuss why you chose your approach, what alternatives you considered, and any concerns you have. How valid is your model? Include any key indicators of model performance.
Using one-hot encoding for categorical columns
model_data = pd.get_dummies(data)
model_data.head()
Defining targets and features set
# Features
X = model_data[['avg_dist', 'avg_rating_by_driver', 'avg_rating_of_driver', 'avg_surge', \
'surge_pct', 'trips_in_first_30_days', 'ultimate_black_user', 'weekday_pct', \
'city_Astapor', "city_King's Landing", 'city_Winterfell', \
'phone_Android', 'phone_iPhone', 'phone_other']]
# Targets
y = ((ref_date - data.last_trip_date) <= '30 days').astype(int)
Develop model
The problem at hand is a binary classification. For this kind of situations SVM, logistic regressor classifier and decision trees are good options. However we want to interpret the results at in addition to having a reliable prediction. This eliminates SVM, given that despite its high accuracy, its results are harder to interpret. Between the logistic regressor and the decision tree, I decided to use a decision tree given that the interpretation of its results is easier to analyze. But with decision trees it's important to be careful because it is easier to overfit the model. So, among the decision tree models I chose a random forest classifier given that it prevents overfitting and reduces bias more easily.
In this problem we care about identifying properly the users that are active in the 6th month after subscription. A user active after 6 months is a positive class or 1, and an inactive user is a negative class or 0. Given that the data set is unbalanced with the positive classes representing only about 37% of the data, then the most appropriate metric to evaluate the model is the precision. Precision is the fraction of correctly detected positive targets among all the targets detected as positive.
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.ensemble import RandomForestClassifier
from time import time
# Definition of random seed
seed = 14
np.random.seed(14)
# Divide data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed, \
shuffle=True, stratify=y)
# Model
model = RandomForestClassifier(random_state=seed, n_jobs=-1)
# Parameters for grid search
parameters = {'n_estimators': [16, 20, 25], \
'max_depth': [75, 80], \
'max_features': ['sqrt'], \
'min_samples_leaf': [18, 19], \
'min_samples_split': [10, 13, 15]}
# Defining cross validation object
cv = KFold(n_splits=5, shuffle=True, random_state=seed)
# Define grid search
grid_search = GridSearchCV(model, param_grid=parameters, cv=cv, scoring='precision')
# Perform grid search for best parameters
start = time()
grid_search.fit(X_train, y_train)
end = time()
print("Grid-search time: %.3f" %(end-start))
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))
from sklearn.metrics import precision_score
# Model
model = RandomForestClassifier(max_depth= 75, \
max_features= 'sqrt', \
min_samples_leaf= 18, \
min_samples_split= 10, \
n_estimators= 25, \
random_state=seed)
# Fit model
model.fit(X_train, y_train)
# Predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
# Metrics
precision_train = precision_score(y_train, y_train_pred)
precision_test = precision_score(y_test, y_test_pred)
print("Precision on training data: \n\t{}".format(precision_train))
print("Precision on test data: \n\t{}".format(precision_test))
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import roc_curve, auc
confusion = confusion_matrix(y_test, y_test_pred)
accuracy = accuracy_score(y_test, y_test_pred)
precision, recall, f1_score, support = precision_recall_fscore_support(y_test, y_test_pred, beta=1)
# Predict probabilities, the second element of the predictions
# contains the probability of having a positive
y_test_prob = [prob[1] for prob in model.predict_proba(X_test)]
fpr, tpr, thresholds = roc_curve(y_test, y_test_prob)
auc_roc = auc(fpr, tpr)
# Plot ROC
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='CNN')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()
print("Area under ROC: {:.4f}".format(auc_roc))
print("Confusion matrix:\n {}".format(confusion))
print("Accuracy: {:.2f}%".format(accuracy*100))
print("Precision for 0 and 1: {:.2f}% and {:.2f}%".format(precision[0]*100, precision[1]*100))
print("Recall for 0 and 1: {:.2f}% and {:.2f}%".format(recall[0]*100, recall[1]*100))
print("F1 score for 0 and 1: {:.2f}% and {:.2f}%".format(f1_score[0]*100, f1_score[1]*100))
3. Briefly discuss how Ultimate might leverage the insights gained from the model to improve its long term rider retention (again, a few sentences will suffice).
Let's interpret the result by analyzing the features importance and vizualizing one of the trees.
feature_importance = model.feature_importances_
features = X.columns
print("Feature importance:")
for i in range(len(features)):
print("\t X_{} -- {}: {:.5f}".format(i, features[i], feature_importance[i]))
from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus
dot_data = StringIO()
export_graphviz(model.estimators_[24], out_file=dot_data,
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())